The 3D architecture of the pepper genome and its relationship to function and evolution

The organization of chromatin into self-interacting domains is universal among eukaryotic genomes, though how and why they form varies considerably. Here we report a chromosome-scale reference genome assembly of pepper (Capsicum annuum) and explore its 3D organization through integrating high-resolution Hi-C maps with epigenomic, transcriptomic, and genetic variation data. Chromatin folding domains in pepper are as prominent as TADs in mammals but exhibit unique characteristics. They tend to coincide with heterochromatic regions enriched with retrotransposons and are frequently embedded in loops, which may correlate with transcription factories. Their boundaries are hotspots for chromosome rearrangements but are otherwise depleted for genetic variation. While chromatin conformation broadly affects transcription variance, it does not predict differential gene expression between tissues. Our results suggest that pepper genome organization is explained by a model of heterochromatin-driven folding promoted by transcription factories and that such spatial architecture is under structural and functional constraints.


Supplementary Method 1. Genome size estimation
To obtain preliminary information on the genomic characteristics before large-scale genome sequencing for the inbred line CA59, we first used G.C.E (Genome Characteristics Estimation)(1.0.2) 1 with ~362.0 Gb BGI short reads (150 bp pair-end) to estimate the genome size, repeat content, and heterozygous rate based on the distribution of 17-mer frequency calculated by kmerfreq. Quality control of BGI sequencing short reads was conducted using Trimmomatic (0.38) 2 .
The Kmer_number is calculated using the following formula: Kmer_number = reads_number * (reads_len -kmer_len + 1). (1) The Heterozygous rate and repeat frequency are calculated using the following formulas: Repeat frequency = 1 -b[½] -b [1] ( 3) where a [½] indicates the ratio of unique k-mers in all the kmer species in the half genome coverage peak of genome coverage peak, b [1] indicates the ratio of unique k-mers in all the kmer individuals in the genome, and b [½] indicates the ratio of unique k-mers in all the k-mers individuals in the half genome coverage peak of genome coverage peak.
The parameters of kmerfreq and G.C.E were set as follows: The G.C.E analysis shows that the estimated genome size of the CA59 line is about 2.95 Gb, the heterozygous rate is about 0.23%, and the repeat frequency of the genome is about 76.17%.

Supplementary Method 2. Genome assembly
De novo assembly of the CA59 genome was carried out as follows: Step 1: Long reads (PacBio) sequencing Extraction of high-molecular-weight DNA from young leaves was carried out using a modified cetyltrimethylammonium bromide (CTAB) method 3 . About 10 μg of genomic DNA was used for preparing template libraries of 30~40-kb using the BluePippin Size Selection system (Sage Science, USA) following the manufacturer's protocol (Pacific Biosciences, USA). The libraries were sequenced on the PacBio SEQUEL II platform with three SMRT flow cells. The summary statistics of raw PacBio long reads are provided in Supplementary Table 1.
Step 2: Filtering out short and low-quality reads To filter out short and low-quality raw PacBio long reads, we sorted all reads based on their length in descending order. We only included the top 200.0 Gb longest reads for genome assembling. The summary statistics of the selected PacBio long reads for genome assembly are provided in Supplementary Table 1.

pilon --threads ${threads} --genome ${draft_genome} --frags ${mapped_bam} --fix snps,indels --output ${polished_genome}
Step 6: Quality evaluation To assess the quality of the genome assembly, we calculated two metrics: BUSCO and the Phred quality score QV value. We used BUSCO (3.02) 7 based on the embryophyta_odb9 data set to assess the completeness of the gene space of the assembly. We mapped BGI short reads to the final polished assembly using Bowtie2 (2.4.4) 8 with the default parameters. Freebayes (1.3.4) 9 was run with the command: freebayes -C 2 -0 -O -q 20 -z 0.10 -E 0 -X -u -p 2 -F 0.75 -b QV_mapping.bam -v QV.vcf -f Capsicum_finalpolsih.fasta The QV was computed as QV = -10log10(B/T), [4] where B was the total number of variant sites (insertions/deletions/SNPs) obtained from the above QV.vcf file, and T is the number of the genome sites with at least 3 mapped reads..

Supplementary Method 5. Short-read RNA-seq data processing
We generated RNA-seq data for 5 tissues, including bud, leave, placenta, pulp, and root. Raw data were preprocessed using Trimmomatic (0.38) 2 to trim adapter sequences and filter out low-quality reads. Clean RNA-seq reads were aligned to the CA59 genome using HISAT2(2.21) 15

Supplementary Method 11. ChIP assay
Grind young leaf (2g) into a fine powder in liquid nitrogen and then crosslinked with 1% formaldehyde for 10 min at room temperature. After sonication, immunoprecipitation was performed with antibodies.
ChIP was performed using antibodies against the following: H3K4me3 (Abcam, ab8580), H3K27me3 (Millipore 07-499), and H3K9me2 (Abcam, ab1220). The immunoprecipitated complex was washed, and DNA was extracted and purified by Universal DNA Purification Kit (QIAquick PCR Purification Kit, 28106). The ChIP-Seq library was prepared using the ChIP-Seq DNA sample preparation kit (NEBNext® Ultra™II DNA) according to the manufacturer's instructions. For ChIP-seq, extracted DNA was ligated to specific adaptors followed by deep sequencing in the Illumina Novaseq 6000 using 150bp paired-end. Samples were prepared and sequenced with assistance from Shanghai Jiayin Biotechnology Co., Ltd.
ChIP-seq mapping and peaks calling were run with the commands:

Supplementary Note 1. Analysis of LTR-RTs in the C. annuum genome
We initially identified 7,074 full-length LTR-RTs in the CA59 assembly, corresponding to an average of 2.5 elements per megabase (Mb). This density is comparable to those also observed from high- TSDs, and 354,257 truncated elements (required to cover at least 80% of the size of the full-length element, or containing at least one LTR) ( Supplementary Fig. 6e). These elements totally account for 45% of the genome, leaving a substantial proportion (~34%) of previously annotated LTR-RT sequence that was not taken into account due to their highly fragmentary structure. Notably, the number of solo-LTRs without TSDs is about four times as those with TSDs suggest that inter-element unequal recombination 21 is more prevalent than intra-element ones in the pepper genome.
Calculating from these newly identified LTR-RT elements, the overall ratio of solo LTRs (S) to intact elements (I) in the pepper genome is 2.52 which is significantly higher than rice 22 , soybean 23 , maize, and other three Solanaceous species. In contrast to previous reports in tomato and rice, the S/I ratio in recombination-suppressed pericentromeric regions (2.80) is slightly higher than that in gene-rich euchromatic regions (2.29) (Supplementary Fig. 6f). To account for this opposing finding, we separated the LTR-RT families into two categories based on their estimated insertion times. We found that young Taken together, our results suggest that it is illegitimate recombination that predominantly drives the rapid decay of LTR-RTs in the pepper genome, especially in the recombination-suppressed pericentromeric regions.

Supplementary Note 2. Analysis of TAD-like domains inferred by TADtool
We also explored the method TADtool, which is based on the insulation index, to identify TAD-like domains for the pepper genome. We used a leaf Hi-C contact matrix which was corrected by the BNBC program for testing and comparing with other methods. Using the optimized parameters (window size: 100 kb and TAD cutoff 2e7), TADtool inferred 2,070 domains, and these domains covered ~75% of the pepper genome ( Supplementary Fig. 12a). About ~66% of domains inferred by TADtool can be also found in the other three methods ( Supplementary Fig. 12b).
We next applied the TADtool to all eight samples and found that domain calls were largely consistent across tissues both in location and size (as shown in the below figure). A hierarchical clustering analysis based on the conservation of domains and boundaries also demonstrated that domain calls were reproducible across tissues and replicates ( Supplementary Fig. 12c,d). Roughly, between 58% and 79% of TAD-like domains, and between 60% and 91% of the boundaries were shared across pairwise sample comparisons ( Supplementary Fig. 12c,d). At least 75% of domains identified in one tissue were also detected in other tissues ( Supplementary Fig. 12e)   Fig. 7a), suggesting they are associated with decreases in gene expression. We repeated the analysis using transcription levels measured in bins (i.e. 40 kb) and obtained similar results (Supplementary Table 14-15 and Supplementary Fig. 18a).
These results suggest that subcompartment patterning has limited effects on differential gene  Step1: PacBio reads were first corrected by MECAT2, and then trimmed and assembled using CANU. In this step, we obtained an original assembly of 623 gapless contigs. b Step2: The resulting contigs were further polished with short reads three times using Pilon. c Step3: Chromosome conformation capture (Hi-C) data were next used to scaffold the polished contigs using Juicer and 3D-DNA pipeline. According to this, the map resolution of six samples is between 5 kb and 10 kb, and the other two samples are slightly lower than 10kbresolution.    Supplementary Fig. 18. The relationship between subcompartment switching and change in gene expression. a, Genomic regions (i.e. 40-kb bins) switching from A to B compartments or from higher subcompartments to lower subcompartments (e.g. from A1.1 to A1.2) show a trend of decreasing expression, and conversely, switching from B to A compartment or from lower subcompartments to higher subcompartments show a trend of increasing expression. Pairwise comparisons of subcompartment shifts for expression profiles across the leaf, bud, pulp, and placenta are shown(supplement to Fig. 7a). Analyses were conducted in two ways: 1) only considered one replicate (i.e. a subcompartment switching event only needed to be supported in the first replicate), and 2) two replicates (i.e. a subcompartment switching event needed to be supported by both replicates).

Supplementary
The expression level was measured in genes or 40-kb bins. The box plots span from 25th to 75th percentile, the center lines show the median, and whiskers extend to 1.5 times IQR. The numbers under lower whiskers indicate the sample size used in the analysis. P values from one-sided Wilcoxon ranked sum tests. b, 40-kb bins with decreased expression were slightly enriched for cases of subcompartment switching from a higher rank to lower ranks, while those with increased expression were slightly enriched for cases of subcompartment switching from lower ranks to higher ranks (supplement to Fig. 7b).